This notebook details my approach to building predictive models for newly released games on boardgamegeek.com (BGG). Specifically, I am interested in taking any newly released boardgame and, using features that are available at the time of its release, estimating how it will be received on BGG: its average rating, number of user ratings, and complexity rating.
While the goal of this project is ultimately to yield accurate predictions for upcoming games, we are also interested in understanding what the models learn. What features of games are associated with high/low average rating? Why do some games receive high numbers of user ratings? What types of games are the most complex?
To answer these questions, we’ll make use of historical data from boardgamegeek. We will connect to a database on GCP containing a variety of tables on game features and their current ratings on BGG. For this analysis, in training models, we will restrict ourserlves to games published through 2019. We will validate the performance of our models by evaluating their performance in predicting games published in 2020.
The data we are using comes from boardgamegeek.com, which we access by using the open BGG API. We are training models on data that last pulled from BGG on 2022-02-21.
We will be training models at the game-level, where every row corresponds to one game and every column corresponds to a feature of the game. As of this date, there are 22071 on boardgamegeek with at least 30 user ratings.
For the purpose of this analysis, the training set will only include games 2020 that have achieved at least 100 user ratings. This is a design decision to restrict our sample to games that 1) have received some evaluation from the community and 2) speed up the time in training models. We can later view this as a parameter for tuning, allowing more or less historical games to enter the model for training. Based on some initial tests, 100 was a useful cutoff point for both model performance and training time.
We are interested in modeling a number of different outcomes: a game’s average rating, complexity rating, and number of user ratings.
These outcomes aren’t independent, as complexity and the average rating are highly correlated.
As we will see, this means if we want to predict a game’s average rating, the most important feature is usually its average weight. But because these a game’s average rating and complexity are both voted on by the BGG community, we won’t know a game’s average rating at the time of its release. This means for newly upcoming games, we will first use a model to estimate a game’s complexity and then use that estimate as the input into our average rating model.
What features do we have about games? We have basic information about every game, such as its player count and playing time, and we also have many BGG outcomes, such as the number of comments, number of people trading, which we will not use in predicting the outcomes we care about. We have some missingness present in the playing time variables that we will address in our recipe preparing the data.
We also have a variety of information about game mechanics, categories, artists, publishers, designers, artists, and so on. Some of these categories are not observed for every game, such as if a game doesn’t have expansions or integrations with other games.
This means there are ~180 different mechanics, ~7k publishers, ~ 10k designers, and ~ 11k artists present in our training set. This is good in the sense that we have ample information about games for models to look at and use in training, but bad in the sense that if we threw all of it into a model we would quickly run up against the the curse of dimensionality.
type | n_types | n_games |
publisher | 7,177 | 22,070 |
category | 84 | 21,789 |
designer | 9,978 | 21,445 |
mechanic | 182 | 20,456 |
family | 3,566 | 18,392 |
artist | 11,128 | 15,995 |
expansion | 23,528 | 5,604 |
implementation | 5,622 | 5,000 |
integration | 2,415 | 1,770 |
compilation | 671 | 863 |
How can we make use of this information for modeling? We could create dummy variables for every different type, but this will quickly create thousands of features, many of which are going to contain little information. We would view this as a P > N problem and let the data speak for itself via methods of feature selection and dimension reduction.
Alternatively, every game had only one mechanic/designer/publisher, we could mean encode on the training set. For instance, instead of using thousands of dummy variables for each designer, we would have one ‘designer_mean’ feature that is simply the value the designer’s mean value in the training set. This can dramatically reduce the dimensionality of categorical features while keeping the information we want.
For our purposes, the hang up with taking a simple mean encoding approach is that a game may have multiple designers, categories, mechanics, artists, and publishers. For designers we might be able to get by with taking the mean of the designer means, but it starts to get more complicated with mechanics - most games have multiple different mechanics, and its the combination of different mechanics that are we interested in exploring. The other complication is that some designers have only designed a handful of games, while others have designed hundreds, so the mean may not impart the same amount of information.
On top of all of this, we have to be careful in what features we allow to enter a model, as some of the categories about games are themselves a reflection of the outcomes we want to predict.
With all this in mind, we’ll do bit of inspection to figure out which features of games we’ll allow to enter our training recipe, in essence using a manual filtering method to select features.
One set of features relates to a game’s “family”, which is sort of a catch all term for various buckets that games might fall into: Kickstarters, dungeon crawls, tableau builders, etc. Some of these are likely to be very useful in training a model, while others should be omitted. We don’t, for instance, want to include whether a game has digital implementations, as these are a reflection of a game’s popularity. These sets of features also have a very long tail, with some families only having one or two games in them. We’ll filter to remove families with near zero variance, removing features on this variable that apply to a little less than 1% of games.
Some features we won’t include, such as the Mensa Select or implementations on BoardGameArena, as these are outcomes that typically occur when a game has been popular and shouldn’t be used as predictors.
We’ll do the same thing for categories, but this variable is much smaller and generally pretty well organized.
We’ll include all of these, though there will likely be some overlap between these and other features which we can take care of with a correlation filter.
Mechanics are also pretty well organized, so we don’t have to do much filtering.
We’ll just keep all of the mechanics, as these are the main features of games that we’ll focus our attention on.
How should we handle artist and designer effects? We’ll use a much lower minimum proption here, as very few designers would have designed ~ 100 games.
This amounts to allowing for designers once they have released about 15 games. We’ll more or less take the same approach for artists.